Controlling Mixture Component Overlap for Clustering Algorithms Evaluation

نویسنده

  • D. Ziou
چکیده

This paper presents two algorithms for generating artificial mixtures of univariate normal densities with controlled component overlap. Such mixtures are used basically as test benchmarks for the evaluation of clustering algorithms. Although the algorithms are different, they are based on the same definition. Both algorithms use formal methods to ensure the generation of “non totally overlapped” mixtures, in contrast to algorithms in the literature which generally use ad-hoc methods to control the overlap. For this purpose, the overlap is assumed to affect only two adjacent components of the mixture. Three configurations are defined: the “maximum overlap” beyond which two components are considered to be totally merged, the “minimum overlap” beyond which two components are considered “nonoverlapped”, and the “rate of overlap”, denoted by λ (0 < λ ≤ 1), which defines a given overlap between the maximum and the minimum. Both algorithms presented in this paper are, then, designed to generate mixtures that control these overlap configurations. The first algorithm uses the width of components to control the overlap, while the second uses the mean. The first algorithm allows a better control over the boundaries of the generated mixture, but involves complex relations. It has been used to generate unidimensional mixtures to approximate image histograms. The second algorithm involves less complex relations, and allows one to control the rate of overlap. The reason for controlling the rate of overlap is to generate artificial mixtures constituting a given degree of difficulty. Indeed, the greater the rate of overlap, the more complex the mixture. Both algorithms allow the generation of mixtures with any number of components. Received November 2, 2001 1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms

The R package MixSim is a new tool that allows simulating mixtures of Gaussian distributions with different levels of overlap between mixture components. Pairwise overlap, defined as a sum of two misclassification probabilities, measures the degree of interaction between components and can be readily employed to control the clustering complexity of datasets simulated from mixtures. These datase...

متن کامل

Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms

A new method is proposed to generate sample Gaussian mixture distributions according to pre-specified overlap characteristics. Such methodology is useful in the context of evaluating performance of clustering algorithms. Our suggested approach involves derivation of and calculation of the exact overlap between every cluster pair, measured in terms of their total probability of misclassification...

متن کامل

The Geometry of Kernelized Spectral Clustering

Clustering of data sets is a standard problem in many areas of science and engineering. The method of spectral clustering is based on embedding the data set using a kernel function, and using the top eigenvectors of the normalized Laplacian to recover the connected components. We study the performance of spectral clustering in recovering the latent labels of i.i.d. samples from a finite mixture...

متن کامل

Assessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories

In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...

متن کامل

The Geometry of Kernelized Spectral Clustering by Geoffrey Schiebinger1,

Clustering of data sets is a standard problem in many areas of science and engineering. The method of spectral clustering is based on embedding the data set using a kernel function, and using the top eigenvectors of the normalized Laplacian to recover the connected components. We study the performance of spectral clustering in recovering the latent labels of i.i.d. samples from a finite mixture...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002